Problems tagged with vanishing gradients

Problems tagged with "vanishing gradients"

Problem #152

Tags: vanishing gradients, quiz-07, lecture-14, neural networks

Which of the following best describes the vanishing gradient problem that can occur while training deep neural networks using backpropagation?

Solution

Partial derivatives of parameters in layers far from the output become very small.

During backpropagation, the chain rule multiplies many factors together as the gradient propagates from the output back through each layer. If these factors are small (e.g., because of sigmoid activations whose derivatives are at most \(1/4\)), the product shrinks exponentially with depth. This means that parameters in early layers (far from the output) receive very small gradient updates and learn very slowly.

Problem #155

Tags: vanishing gradients, quiz-07, lecture-14, neural networks

True or False: one solution to the vanishing gradient problem is to increase the learning rate \(\eta\).

True False

Solution

False.

Increasing the learning rate does not fix the vanishing gradient problem. Vanishing gradients arise because, during backpropagation, the chain rule multiplies many small factors together, causing gradients in early layers to shrink exponentially with depth. Increasing \(\eta\) scales all gradient updates equally, so the relative imbalance between layers remains: early-layer gradients are still exponentially smaller than later-layer gradients.

Actual solutions to the vanishing gradient problem include using ReLU activations (whose derivatives are \(0\) or \(1\), rather than always small), skip connections, and batch normalization.